This is one page of the R Handbook for Epidemiologists, but is being printed as a stand-alone page.
You can find the complete handbook on Github
The Page title should be succinct. Consider adding a tag with no spaces into the curly brackets, such as below. This can be used for internal links within the handbook. {#title_tag .tabset .tabset-fade}
Keep the title of this section as “Preparation”.
Data preparation steps such as:
This code chunk shows the loading of packages required for the analyses.
pacman::p_load(rio, # File import
here, # File locator
tidyverse, # data management + ggplot2 graphics
aweek, # working with dates
lubridate, # Manipulate dates
incidence, # an option for epicurves of linelist data
stringr, # Search and manipulate character strings
forcats, # working with factors
RColorBrewer) # Color palettes from colorbrewer2.orgTwo example datasets are used in this section:
The dataset is imported using the import() function from the rio package. See the page on importing data for various ways to import data. The linelist and aggregated versions of the data are displayed below.
For most of this document, the linelist dataset will be used. The aggregated counts dataset will be used at the end.
Review the two datasets and notice the differences
Case linelist
# display the linelist data as a table
DT::datatable(linelist, rownames = FALSE, filter="top", options = list(pageLength = 5, scrollX=T) )Case counts aggregated by hospital
You may want to set certain parameters for production of a report, such as the date for which the data is current (the “data date”). In this case, we set this date as 27 July 2013.
Now we can reference the object data_date into the code and have it reference that date.
Verify that each date column is class Date and has an approporiate range of values. This for loop prints a histogram for each column.
# create character vector of column names
DateCols <- as.character(tidyselect::vars_select(names(linelist), matches("date|Date|dt")))
# Produce histogram of each date column
for (Col in DateCols) { # open loop. iterate for each name in vector DateCols
hist(linelist[, Col], # print histogram of the column in linelist dataframe
breaks = 50, # number of breaks for the histogram
xlab = Col) # x-axis label is the name of the column
} # close the loopincidence packageBelow are tabs on making quick epicurves using the incidence package
CAUTION: Epicontacts expects data to be in a “linelist” format of one row per case (not aggregated). If your data is aggregated counts, look to the ggplot epicurves tab.
TIP: The documentation for plotting an incidence object can be accessed by entering ?plot.incidence in your R console.
The incidence package requires 2 steps to plot an epicurve:
incidence())A simple example - an epicurve of daily cases:
# load package
library(incidence)
# create the incidence object using data by day
epi_day <- incidence(linelist$date_onset, interval = "day")
# plot the incidence object
plot(epi_day) #### Change aggregation of cases
The interval defines how the observations are grouped. Available options include all options in the package aweek, including but not limited to:
# Weekly
epi_wk <- incidence(linelist$date_onset, interval = "Monday week")
# MMWR week
epi_MMWRwk <- incidence(linelist$date_onset, interval = "MMWRweek")
# Three weeks
epi_3wk <- incidence(linelist$date_onset, interval = "3 weeks")
# Monthly
epi_month <- incidence(linelist$date_onset, interval = "month")
# Plot the incidence objects
plot(epi_wk)
plot(epi_MMWRwk)
plot(epi_3wk)
plot(epi_month) The first date and last date of the plot can also be specified.
The incidence package enables modifications in the following ways:
plot()scale_x_incidence()ggplot() additionsRead details in the Help files by entering ?scale_x_incidence and ?plot.incidence in the R console. Online vignettes are listed in the resources tab.
Aesthetic themes and other ggplot() syntax can be added with the + symbol, because incidence is using ggplot() in the background.
For a smaller outbreak, it may be important to show boxes around each individual case. Use the argument show_cases = TRUE in the plot() function.
Create the weekly incidence object for a smaller outbreak (only cases from Central Hospital)
# create filtered dataset for Central Hospital
central_data <- linelist %>%
filter(hospital == "Central Hospital")
# create incidence object (weekly)
central_outbreak <- incidence(central_data$date_onset, interval = "Monday week")And now plot the epicurve:
# plot outbreak, showing each case
plot(central_outbreak,
show_cases = T) # show boxes around each individual caseThe same epicurve showing individual cases, but with many other modifications:
# add plot() arguments and ggplot() commands
plot(central_outbreak,
show_cases = T, # show boxes around each individual case
#labels_week = T, # if TRUE: x-axis dates display as weeks ("YYYY-Www")
color = "lightblue", # color inside boxes
border = "darkblue", # border color of boxes
alpha = 0.5, # transparency
xlab = "Date of onset", # x-axis label
ylab = "Weekly reported cases", # y-axis label
n_breaks = 7)+ # to override auto number of x-axis breaks
# ggplot() commands added to the plot
ggtitle("Weekly case incidence at Central Hospital")+ # add title separately
scale_x_date(date_labels = "%d %b")+ # adjust how dates are displayed
scale_y_continuous(breaks = seq(0, 30, 5))+ # adjust y-axis intervals
theme_minimal()+ # simplify background
theme(axis.title = element_text(size = 12, face = "bold"),
plot.caption = element_text(face = "italic", hjust = 0))+
labs(#x = "",
#y = "",
title = "Weekly case incidence at Central Hospital",
#subtitle = "",
caption = stringr::str_glue("Source: Linelist data from: {data_date}
{nrow(linelist %>% filter(is.na(date_onset)))} cases are missing date of onset and not shown"))Note in the above that the grey gridlines do not align with the case boxes. This is because the gridlines show the 1st of a month, whereas the case boxes are by week. Weeks may overlap the start of months. To align them, include `labels_week = T in plot() and do not specify a scale_x_date().
TIP: While the data may be grouped by day or week, you can quickly reduce the number of labels appearing on the x-axis using scale_x_incidence() and specifying the number of axis breaks.
DANGER: Be cautious setting the y-axis scale breaks (e.g. 0 to 30 by 5: seq(0, 30, 5)). Static numbers can cut-off your data if the data changes!.
To enforce a x-axis label for every bin (daily, weekly, etc.), set the n_breaks = argument in plot() to the number of rows in your incidence object: n_breaks = nrow(epi_wk)
scale_x_date(date_breaks = , date_labels = ) discuss. TO DO
To color the cases by group, provide the column to the groups = argument in the incidence() command. In the example below the cases are colored by their generation in the overall transmission chain.
Note the incidence() argument na_as_group = which if FALSE will prevent missing values (NA) from being categorized as their own group.
# Create epiweek object, with counts grouped by gender
gen_outbreak <- incidence(linelist$date_onset,
interval = "week",
groups = linelist$generation,
na_as_group = FALSE) # Missing values not assigned their own group
# plot the epicurve
plot(gen_outbreak) To change the legend, use ggplot() commands such as theme(legend.position = "top", legend.direction = "horizontal", legend.title = element_blank()). See the page of ggplot() tips for more details.
To change the colors:
Change the color palette to one of the default base R palettes using the argument col_pal in plot() (do not put the name of the palette in quotes).
Other palettes include TO DO add page with palette names… To DO
To specify colors manually, provide the name of the color or a character vector of colors to the argument color =. Note to function properly the number of colors must equal the number of groups (be aware of missing values as a group)
hosp_outbreak <- incidence(linelist$date_onset,
interval = "week",
groups = linelist$hospital,
na_as_group = FALSE) # Missing values not assigned their own group
# default colors
plot(hosp_outbreak)
# manual colors
plot(hosp_outbreak, color = c("darkgreen", "darkblue", "purple", "grey", "yellow", "orange"))To plot the epicurve of a subset of data, filter the data first and then use the subset in the incidence() command. The example below uses data filtered to show only female cases.
# filter the dataset
female_data <- filter(linelist, gender == "f")
# create incidence object using female-only data
female_outbreak <- incidence(female_data$date_onset, interval = "week")
# plot
plot(female_outbreak, color = "orange", border = "black", ylab = "Weekly case incidence")To facet the plot by a variable (make “small multiples”), see the tab on epicurves with ggplot()
ggplot2Below are tabs on using “ggplot2” package
# Daily case counts
###################
plot_daily <- ggplot(linelist, aes(x = date_onset)) +
# stacked bars, bined by day (1 days)
stat_bin(binwidth = 1, position="stack")
print(plot_daily)
# Weekly case counts
###################
plot_weekly <- ggplot(linelist, aes(x = date_onset)) +
# stacked bars, bined by week (7 days)
stat_bin(binwidth = 7, position="stack", fill = "brown")
print(plot_weekly)# Preparation
#############
# Create epiweek variable. Factor argument automatically includes all weeks in span. Numeric shows just the week number.
linelist$epiweek <- aweek::date2week(linelist$date_onset, factor = TRUE, numeric = TRUE)
# Calculate maximum number of cases in an epiweek, to get the maximum y-axis height (also helps with uniformity in multiple plots)
ymax <- max(summary(factor(linelist$epiweek), maxsum = length(linelist$epiweek)))
# Weekly case counts
###################
plot_weekly <- ggplot(linelist, aes(x = date_onset)) +
# stacked bars, bined by week (7 days)
stat_bin(binwidth = 7, position = "stack", fill = "grey", color = "black") +
# X-axis 21-day labels
scale_x_date(
# Sets date label breaks as every 3 weeks from Monday before the first case
breaks = function(x) seq.Date(from = min(linelist$date_onset, na.rm = T), to = max(linelist$date_onset, na.rm=T), by = "3 weeks"),
# axis limits determined by max/min + buffer
limits = c((min(linelist$date_onset, na.rm = T) - 8), (max(linelist$date_onset, na.rm = T) + 8)),
# displays as date number, then abbreviated month (e.g. 12 Oct)
date_labels = "%d-%b",
# sets origin at (0,0)
expand = c(0,0)) +
# Y-axis breaks every 5 cases
scale_y_continuous(breaks = seq(0, ymax, 25),
limits = c(0, ymax),
expand = c(0, 0)) +
# Theme specifications (axis, text, etc.)
theme(# title
plot.title = element_text(size=20, hjust= 0, face="bold"), # title size, font, bold
# axes
axis.text.x = element_text(angle=90, vjust=0.5, hjust=1),
axis.text = element_text(size=12),
axis.title = element_text(size=14, face="bold"),
axis.line = element_line(colour="black"),
# background
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank(),
# caption (italics, on right side)
plot.caption = element_text(hjust = 0, face = "italic")
) +
guides(fill = guide_legend(reverse = TRUE, # Orders Non-active zones at end of legend
override.aes = list(size = 0.2),
ncol = 2)) + # Number of legend columns
labs(x = "Week of illness onset",
y = "Number of cases",
subtitle = "subtitle here")
ggtitle("Epidemic curve")
## $title
## [1] "Epidemic curve"
##
## attr(,"class")
## [1] "labels"
# print
print(plot_weekly)Colored by a category
# Setup
########
# Two known classes (select colors from colorbrewer2.org)
colors_overall = c("#d95f02", #
"#1b9e77",
"#7570b3") #
# Order sex variable by reverse # of cases, so plot stacks with smallest # of cases at top
linelist$gender <- factor(linelist$gender,
levels = levels(fct_rev(fct_infreq(linelist$gender))))
# Calculates maximum yaxis height for uniformity between the two graphs
ymax <- max(summary(factor(linelist$epiweek), maxsum = length(linelist$epiweek)))
# Number missing onset_date and cannot be graphed
missing_onset <- nrow(linelist[is.na(linelist$date_onset),])
# PLOT - BY ONSET DATE
######################
plot_defined_cats <- ggplot(linelist, aes(x = date_onset, fill = gender)) +
# stacked bars, width of 7 days
stat_bin(binwidth = 7, position = "stack") +
# Colors and labels of confirmed/probable
scale_fill_manual(values = rev(colors_overall),
labels = str_to_sentence(levels(factor(linelist$gender)))) +
# X-axis scale labels (not aggregation, just the labels)
scale_x_date(# Sets date label breaks as every week
breaks = function(x) seq.Date(from = min(linelist$date_onset, na.rm = T), to = max(linelist$date_onset, na.rm = T), by = "3 weeks"),
limits = c((min(linelist$date_onset, na.rm=T)), (max(linelist$date_onset, na.rm = T))), # axis limits determined by max/min + buffer
date_labels = "%d-%b", # displays as date # then abbreviated month (e.g. 12 Oct)
expand = c(0, 0)) + # sets origin at (0,0)
# Y-scale in breaks, up to the ymax previously defined
scale_y_continuous(breaks = seq(0, 500, 25), limits = c(0, ymax), expand=c(0, 0)) +
# Themes for axes, titles, background, etc.
theme(plot.title = element_text(size=20, hjust=0.5, face="bold"),
axis.text = element_text(size=12),
axis.title = element_text(size=14, face="bold"),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank(),
axis.line = element_line(colour = "black"),
axis.text.x = element_text(angle=90, vjust=0.5, hjust=1)) +
# Legend specifications
theme(legend.title = element_blank(),
legend.justification = c(0, 1),
legend.position = c(0.09, 0.98),
legend.background = element_blank(),
legend.text = element_text(size = 12)) +
guides(fill = guide_legend(reverse = TRUE, override.aes = list(size = 0.2))) +
# Axis and caption labels
labs(x = "Week of illness onset",
y = "Number of Cases") +
# Title
ggtitle("Cases by week of illness onset")
# print
print(plot_defined_cats)# PARAMETERS
#############
# Maximum y-value for epiweek (this will be larger than necessary because of missing onset dates)
ymax <- max(table(linelist$epiweek))
# Number missing onset_date and cannot be graphed
missing_onset <- nrow(filter(linelist, is.na(date_onset)))
# SETUP - ACTIVE/NON-ACTIVE ZONES
#################################
# List of "active" zones with a case in the date range
active_zones <- unique(linelist$province[which(linelist$date_onset > (data_date - 90))])
active_zones
# Table of active zones and their overall number of cases (for ordering their stacked appearance)
order_table <- linelist %>%
filter(province %in% active_zones) %>%
group_by(province) %>%
summarise(cases = n())
order_table
# Create TRUE/FALSE variable for "active" health zones
linelist$active_zone <- ifelse(linelist$province %in% active_zones, TRUE, FALSE)
# Create list of non-active HZ names for bottom of plot
other_zone_names <- unique(sort(linelist$province[linelist$active_zone == FALSE]))
# Make variable for graph categories, including a level for "non-active" zones
linelist$graph_zone <- factor(case_when(
# Value assignments
# Non-active zones
linelist$active_zone == FALSE ~ "Non-active zones",
# All others are assigned their names, capitalized
TRUE ~ stringr::str_to_title(linelist$province)),
# Order of variable levels
levels = c(
# "Non-active zones" is first level
"Non-active zones",
# Orders active zones by their frequency in linelist, reversed, so most-affected zones are on the BOTTOM of plot
str_to_title(rev(levels(fct_infreq(as.factor(linelist$province[linelist$active_zone == TRUE])))))))
table(linelist$graph_zone, useNA = "ifany")
# COLORS
########
# Number of unique values in graph_zone variable, minus 1 (for non-active, which is added later as grey (#cccccc))
colors_needed <- length(unique(linelist$graph_zone, na.rm=T)) - 1
# List of possible colors (see colorbrewer2.com, qualitative scheme)
colors_linelist = c(#"#cccccc", # first = non-active grey color
"#1b9e77", # turquoise green
"#ff7f00", # orange
"#ffff33", # yellow
"#6a3d9a", # purple
"#b15928", # brown
"#1f78b4", # blue
"#e31a1c", # red,
"#fb9a99", # pink
"#b2df8a", # light green
"#cab2d6", # light purple
"#a6cee3", # light blue
"#fdbf6f", # beige
"#33a02c" # green
)
# Reduce number of colors to only the number needed
colors_linelist <- c("#cccccc", rev(colors_linelist[1:colors_needed]))
# MAKE GRAPH
#############
plot_overall <- ggplot(linelist, aes(x = date_onset, fill = graph_zone)) +
# stacked bars, bined by week (7 days)
stat_bin(binwidth = 7, position = "stack") +
# Fill of bars
scale_fill_manual(values = colors_linelist,
labels = str_to_sentence(levels(factor(linelist$graph_zone)))) +
# X-axis 21-day labels
scale_x_date( # Sets date label breaks as every 3 weeks from Monday before the first case
breaks = function(x) seq.Date(from = min(linelist$date_onset, na.rm = T), to = max(linelist$date_onset, na.rm = T), by = "3 weeks"),
limits = c((min(linelist$date_onset, na.rm = T) - 8), (max(linelist$date_onset, na.rm = T) + 8)), # axis limits determined by max/min + buffer
date_labels = "%d-%b", # displays as date number, then abbreviated month (e.g. 12 Oct)
expand = c(0,0)) + # sets origin at (0,0)
# Y-axis breaks every 5 cases
scale_y_continuous(breaks = seq(0, ymax, 25),
limits = c(0, ymax),
expand = c(0, 0)) +
# Theme specifications (axis, text, etc.)
theme(plot.title = element_text(size = 20, hjust = 0, face = "bold"), # title size, font, bold
axis.text.x = element_text(angle=90, vjust=0.5, hjust=1),
axis.text = element_text(size=12),
axis.title = element_text(size=14, face="bold"),
axis.line = element_line(colour="black"),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank(),
plot.caption = element_text(hjust = 0, face = "italic")
) +
# Legend specifications
theme(legend.title = element_blank(), # No legend title
legend.position = c(0.20, 0.85), # placement of legend
legend.background = element_blank(), # legend background
legend.text = element_text(size=12)) + # legend text size
guides(fill = guide_legend(reverse = TRUE, # Orders Non-active zones at end of legend
override.aes = list(size = 0.2),
ncol = 2)) + # Number of legend columns
labs(x = "Week of illness onset",
y = "Number of cases",
subtitle = "Health zones with cases in the last 42 days specified by color",
caption = paste0(nrow(linelist),
" confirmed and probable cases, reported as of ", data_date, ". ",
missing_onset, " cases missing date of onset and not shown.",
"\nNon-active zones include: ", str_to_title(toString(unique(linelist$province[linelist$active_zone == FALSE]))))) +
ggtitle("Epidemic curve by active health zones")
# print
plot_overall
#SETUP
#############
# Filter to health zone of interest
zone_data <- linelist
# Number missing onset_date and cannot be graphed
missing_onset <- nrow(filter(linelist, is.na(date_onset)))
# Assign health area groups (individual for HAs of interest, groups others together)
linelist$graph_areas <- factor(case_when(
linelist$province == "Shanghai" ~ "Shanghai",
linelist$province == "Jiangsu" ~ "Jiangsu",
linelist$province == "Zhejiang" ~ "Zhejiang",
TRUE ~ "Other (10)"
),
# Levels part of the factor function assigns order of appearance
levels = c(
"Other (10)",
"Shanghai",
"Jiangsu",
"Zhejiang"
)
)
# checks
table(linelist$graph_areas, useNA = "ifany")
# Color assignments
colors_needed <- length(unique(linelist$graph_areas, na.rm=T)) - 1 # number of colors needed
# list of colors
colors_aire = c("#a6cee3",
"#1f78b4",
"#b2df8a",
"#33a02c",
"#fb9a99",
"#e31a1c",
"#fdbf6f",
"#ff7f00",
"#cab2d6",
"#6a3d9a",
"#ffff99",
"#b15928"
)
# Reduce number of colors to only the number needed
colors_aire <- c("#cccccc", rev(colors_aire[1:colors_needed]))
# Plot of province
#####################################
plot <- ggplot(linelist, aes(x = date_onset, fill = graph_areas)) +
stat_bin(binwidth = 7, position="stack") +
scale_fill_manual(values = colors_aire, labels = str_to_sentence(levels(factor(linelist$graph_areas)))) +
scale_x_date(date_breaks = "1 week", date_labels = "%d-%b", limits = c((min(linelist$date_onset, na.rm = T) - 8), (max(linelist$date_onset, na.rm = T) + 8)), expand=c(0,0)) + # I used the date onset variable here so x axes will be the same
scale_y_continuous(breaks = seq(0, 500, 5), limits = c(0, 35), expand = c(0, 0)) +
theme(plot.title = element_text(size = 20, hjust = 0.5, face = "bold"),
plot.caption = element_text(hjust = 0, face = "italic"),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank(),
axis.line = element_line(colour = "black"),
axis.text = element_text(size = 12),
axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1),
axis.title = element_text(size = 14, face = "bold"),
legend.title = element_blank(),
legend.justification = c(0,1),
legend.position = c(0.05, 1),
legend.background = element_blank(),
legend.text = element_text(size = 12)) +
guides(fill = guide_legend(reverse = TRUE, override.aes = list(size = 0.2), ncol = 4)) +
labs(x="Week of illness onset",
y="Number of cases",
subtitle = "",
caption = paste0(nrow(zone_data), " confirmed and probable cases, as of ", data_date, ". \n", missing_onset, " cases excluded due to missing date of onset.")) +
ggtitle("Cases of influenza, by province")
plot# Define the waves
##################
# zone_data <- filter(linelist, zone_de_sante == "mabalako")
#
# zone_data$wave <- case_when(
# zone_data$date_onset >= as.Date("2018-03-01") &
# zone_data$date_onset < as.Date("2018-10-25") ~ "Wave 1",
#
# zone_data$date_onset >= as.Date("2018-10-25") &
# zone_data$date_onset < as.Date("2019-02-01") ~ "Wave 2",
#
# zone_data$date_onset >= as.Date("2019-02-01") &
# zone_data$date_onset < as.Date("2019-09-15") ~ "Wave 3",
#
# zone_data$date_onset >= as.Date("2019-09-15") ~ "Wave 4",
#
# TRUE ~ NA_character_
# )
#
# table(is.na(zone_data$date_onset))
# table(zone_data$wave, useNA = "always")
#
#
# # Color assignments
# colors_needed <- length(unique(zone_data$wave, na.rm=T)) # number of colors needed
#
# # list of colors
# colors_aire = c("#a6cee3",
# "#1f78b4",
# "#b2df8a",
# "#33a02c",
# "#fb9a99",
# "#e31a1c",
# "#fdbf6f",
# "#ff7f00",
# "#cab2d6",
# "#6a3d9a",
# "#ffff99",
# "#b15928"
# )
#
# # Reduce number of colors to only the number needed
# colors_aire <- c(rev(colors_aire[1:colors_needed]))
#
#
# # Plot of health zone colored by wave
# #####################################
# plot_Mabalako <- ggplot(zone_data, aes(x = date_onset, fill = wave)) +
#
# stat_bin(binwidth = 7, position = "stack") +
#
# scale_fill_manual(values = rev(colors_aire), labels = str_to_sentence(levels(factor(zone_data$wave)))) +
#
# scale_x_date(date_breaks = "21 days", date_labels = "%d-%b",
# limits = c((min(zone_data$date_onset, na.rm = T) - 8), (max(zone_data$date_report, na.rm = T) + 8)), expand = c(0,0)) +
#
# scale_y_continuous(breaks = seq(0, 500, 5), limits = c(0, 35), expand = c(0, 0)) +
#
# theme(text = element_text(family = "Segoe Condensed"),
# axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1),
# axis.text = element_text(size = 12),
# axis.title = element_text(size = 14, face = "bold"),
# axis.line = element_line(colour = "black"),
#
# plot.title = element_text(size = 20, hjust = 0.5, face = "bold"),
# plot.caption = element_text(hjust = 0, face = "italic"),
#
# panel.grid.major = element_blank(),
# panel.grid.minor = element_blank(),
# panel.background = element_blank(),
#
# legend.title = element_blank(),
# legend.justification = c(0,1),
# legend.position = c(0.75, 0.98),
# legend.background = element_blank(),
# legend.text = element_text(size=12)) +
#
# guides(fill = guide_legend(reverse = TRUE, override.aes = list(size = 0.2), ncol = 1)) +
#
# labs(x="Week of illness onset",
# y="Number of cases",
# subtitle = "",
# caption = paste0(nrow(zone_data), " confirmed and probable cases, as of ", data_date, ". \n", missing_onset, " cases excluded due to missing date of onset and 16 excluded due to uncertain health zone of report.")) +
#
# ggtitle("Four waves of EVD in Mabalako health zone")
#
# plot_Mabalako
#
#
# # Produce table describing each wave
# ####################################
# table <- zone_data %>%
# select("aire_de_sante", "wave", "community_death", "date_onset", "cte_date", "epicasedef", "community_death", "contact_registered", "contact_surveilled") %>%
# group_by(wave) %>%
# summarise(first_onset = min(date_onset, na.rm = T),
# last_admission = max(cte_date, na.rm = T),
# n = n(),
# confirmed = sum(epicasedef == "confirmed"),
# community_deaths = paste0(sum(community_death == 1),
# " (", round(100*sum(community_death == 1)/confirmed),"%)"),
# reg_contacts = paste0(sum(contact_registered == "yes"),
# " (", round(100*sum(contact_registered == "yes")/confirmed),"%)"),
# surv_contacts = paste0(sum(contact_surveilled == "yes"),
# " (", round(100*sum(contact_surveilled == "yes")/confirmed),"%)"),
# top = paste(toupper(names(sort(table(aire_de_sante),decreasing=TRUE)[1:3])), collapse=", ",
# round(100*(sort(table(aire_de_sante),decreasing=TRUE)[1:3]/confirmed)), "%"),
# health_areas = paste(toupper(unique(aire_de_sante)), collapse=', ')
# )
#
# kable(table)Often you do not have linelist data, but instead daily case counts from facilities, districts, etc. You can plot these in an epidemiological curve, but the code will be slightly different.
This section will utilize the counts_data dataset that was imported earlier, in the data preparation section.
Note: The incidence package does not support aggregate data
As before, we must ensure date variables are correctly classified.
# Create epiweek variable
# aweek weeks are also stored as dates, facilitating better display manipulation
count_data$epiweek <- aweek::date2week(count_data$date_hospitalisation, # use the Date variable
week_start = "Monday", # epiweek begins on Monday
floor_day = TRUE, # only display year and week #
factor = TRUE) # expand to include all possible weeksggplot(data = count_data, aes(x = as.Date(epiweek), y = n_cases, group = hospital, fill = hospital))+
geom_bar(stat = "identity")+
# LABELS for x-axis
scale_x_date(date_breaks = "1 month", # displays by month
date_labels = '%b%d\n%Y')+ #labeled by month with year below
# Choose color palette (uses RColorBrewer package)
scale_fill_brewer(palette = "Pastel1")+
# Theme specifications (axis, text, etc.)
theme(
# title
plot.title = element_text(size=20, hjust= 0, face="bold"), # title size, font, bold
# axes
axis.text.x = element_text(angle=0, vjust=0.5, hjust=1),
axis.text = element_text(size=12),
axis.title = element_text(size=14, face="bold"),
axis.line = element_line(colour="black"),
# background
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank(),
# caption (italics, on right side)
plot.caption = element_text(hjust = 0, face = "italic"))+
# labels
labs(x = "Week of report",
y = "Number of cases",
subtitle = "Cases aggregated by week and shown by hospital",
caption = "Data source: XXXXX")+
ggtitle("Epidemic curve of disease X in fictional location")Although there are fierce discussions about the validity of this within the data visualization community, many supervisors want to see an epicurve or similar chart with a percent overlaid with a second axis.
In ggplot it is very difficult to do this, except for the case where you are showing a line reflecting the proportion of a category shown in the bars below.
This uses the linelist dataset
TODO not complete yet
library(reshape2)
# group the data by week, summarize counts by group (gender)
linelist_week <- linelist %>%
mutate(onset_epiweek = aweek::date2week(date_onset, floor_day = TRUE, factor = TRUE)) %>%
group_by(onset_epiweek) %>%
summarize(num_male = sum(gender == "m"),
num_female = sum(gender == "f"),
pct_male = round(100*(num_male / n())),
med_age = median(as.numeric(age), na.rm=T)
)
# remove pct and melt
linelist_week_melted <- linelist_week %>%
select(-c("pct_male", "med_age")) %>%
melt(id.vars = c("onset_epiweek"))
# merge together (multiple of the same values in week will attach to melted)
linelist_week_melted <- merge(linelist_week_melted,
linelist_week,
by = "onset_epiweek")
second_axis <- ggplot(linelist_week_melted,
aes(x = as.Date(onset_epiweek),
y = value, group = variable,
fill = variable)) +
# bars
geom_bar(stat = "identity")+
# Colors and labels of confirmed/probable
scale_fill_manual(values = c("blue", "red"),
labels = str_to_sentence(levels(factor(linelist_week_melted$variable)))) +
geom_line(mapping = aes(y = pct_male, color = "% male"), size = 0.5) +
scale_color_manual(values = "black")+
scale_y_continuous(sec.axis = sec_axis(~(./sum(linelist_week_melted$value, na.rm = T)*100), name = "name here", breaks = seq(0, 100, 20)))+
# X-axis scale labels (not aggregation, just the labels)
scale_x_date(# Sets date label breaks as every week
breaks = function(x) seq.Date(from = min(linelist$date_onset, na.rm = T), to = max(linelist$date_onset, na.rm = T), by = "1 week"),
limits = c((min(linelist$date_onset, na.rm=T)), (max(linelist$date_onset, na.rm = T))), # axis limits determined by max/min + buffer
date_labels = "%d-%b", # displays as date # then abbreviated month (e.g. 12 Oct)
expand = c(0, 0)) + # sets origin at (0,0)
# Y-scale in breaks, up to the ymax previously defined
scale_y_continuous(breaks = seq(0, 500, 5), limits = c(0, ymax), expand=c(0, 0)) +
# Themes for axes, titles, background, etc.
theme(plot.title = element_text(size=20, hjust=0.5, face="bold"),
axis.text = element_text(size=12),
axis.title = element_text(size=14, face="bold"),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank(),
axis.line = element_line(colour = "black"),
axis.text.x = element_text(angle=90, vjust=0.5, hjust=1)) +
# Legend specifications
theme(legend.title = element_blank(),
legend.justification = c(0, 1),
legend.position = c(0.09, 0.98),
legend.background = element_blank(),
legend.text = element_text(size = 12)) +
guides(fill = guide_legend(reverse = TRUE, override.aes = list(size = 0.2))) +
# Axis and caption labels
labs(x = "Week of illness onset",
y = "Number of Cases",
caption = paste(missing_onset,"cases were missing onset date and are not included in the onset graph")) +
# Title
ggtitle("Cases by week of illness onset")
second_axis
# print
print(plot_defined_cats)This tab should stay with the name “Resources”. Links to other online tutorials or resources.